FIT5202 - Data Processing for Big Data

Introduction to Python for FIT5202 Activities

In this activity, we will learn and practice some of the useful syntax and functionalities in Python that will help you to go through activities for FIT5202 throughout the semester. Python is a popular, powerful open-source programming language for building data science applications.

Throughout the semester, we will use the Jupyter notebook that provides an interactive environment for writing and running code (e.g. Python and R). Each notebook is associated with a single kernel. As we will be using Python, this notebook is associated with the IPython kernel. If you run your Python code, make sure that you have chosen the "Python 3" kernel on Jupyter notebook (see the upper-right corner of this notebook for this checking); and also, you need to add the code using "Code" cells, not "Markdown" cells. Markdown is a popular markup language that is a superset of HTML.

Python is an interpreted programming language. Thus, you can simply write commands directly into the interpreter and execute them. We will be using Python 3. The following is the plan of our tasks to be done this week:

  • Basic Python syntax and commands
  • Basic data structure
  • Defining classes and functions
  • Usage of the loop statements
  • Useful built-in functions and defining lambda functions
  • Multi-processing functions

Let's get started!

What you need to remember:

  • Run your cells using SHIFT+ENTER (or "Run cell")
  • Run the current cell and insert a new cell below: ALT+ENTER
  • To see more commands, please click the "menu" option (e.g. "Insert", "Cell")
  • To see more keyboard shortcuts, click the above "keyboard image" button. Use "Esc" to enter command mode. Then, you can use a command. Some of the popular shortcuts are
    • Basic navigation: enter, shift-enter, up/k, down/j
    • Saving the notebook: s
    • Cell creation: a (above), b (below)

1. Basic Python Syntax

Import

We can import a module using command import. For example, if we want to import a module (numpy), then we can use the following code: import numpy as np. Here, we can use "as" to remame the module.

Print

Let's first print a string, "Hello World!". You need to use the print statement with parenthesis:

print("Hello World!")

Execute this command in a new cell (i.e. add a new cell below).

Defining variables and assigning them values

In Python, we can define variables without a type. Run the code below into the next cell:

number = 3.14
message = "this is a string message"
print (message, 3.14 - 3.14/10)

Have you understood what the above formula is producing now?

Comments

Note that we can the "#" character to add a comment to the Python script. The rest of the line after "#" is ignored. Run each of the following lines and see the result:

# query1 = this is a comment
print(query1)
question2="# is this a not comment"
print(query2)

Using arithmatic operations

We can use simply Python code for calculation. You can type an expression and immediately after the Python interpreter will write the result. Run each line of the code below:

1+2
10 - 2*3
10/5
5/2

Note division in Python always returns a float. To get an interger result disregarding any fractional number, you can use the "//" operator. Run each line of the code:

11/3
11//3

To calculate the remainder, use "%". To calulator powers, use "**". Run each line of the code:

11%3
2**5

Python also provides very useful built-in functions for arithmetic operations. Please see more details about it: https://docs.python.org/3.8/library/functions.html#round

Comparison and Logical Operators

The following explains the comparison operators with their usage. A comparison operator is used to compare one operand to another, and returns either True or False.

  • ==: Equality operator. Returns True if the two operands are equal.
  • !=: Inequality Operator. Returns True if the two operands are not equal.
  • < Less than. Returns True if the left-side operand is less than the right-side operand.
  • Can you guess how these work? >, <=, >=

Run each line of the following code and see the result:

1 = 2
1 != 2
1 < 2
1 <= 2
1 > 2
1 >= 2

Basic Data Structure

We will learn about basic structure in Python. In particular, we will look into the following:

  • string
  • list
  • set
  • dictionary
  • NumPy

String

Python easily handles strings. They can be enclosed in single quotes ('...') or double quotes ("...") with the same result. The character "\" can be used to escape quotes.

Run the code below:

"hello world"
"\"Yes\" he said"
""Yes" he said"

String literals can span multiple lines by using triple-quotes: """...""" or '''...'''

print("""hello
world""")

Strings can be cancatenated using "+". Run the code:

"Py" + "thon"

Also, strings can be repeated using "*". For example, print "a" three times, run the code:

3 * "a"

We can also index strings. The first character has index 0. Run the code:

word = "python"
word[0] 
word[5]

Also, we can slice a string, which can be used to obtain substring. Run the code:

word[0:2]

How does it look like? When slicing, make sure that the first parameter is the index included, and the second parameter is the index excluded.

If we omit the first index, the default is 0. If we omit the second index, the default is the size of thes string. Run each line of the code:

word[:2]
word[3:]
word[-2:]

Note that the last example looks odd. It actually extracts the characters from the second-last to the end.

Lists

One of the compount data types is the list. In a list, a list of comma-separated items between brackets can be cotained. Run the code:

myList = [1, 3, 5, 7]
myList

Lists also can be indexed and sliced as strings. Run each line of the code:

myList[0]
myList[-1] # the last element
myList[1:]
myList[:]

Also, we can apply concaternation on the lists. Run the code:

myList = myList + [9, 11]

Do you want to change an item in the list? Run the code:

myList[1] = 4

Check the elements of myList.

We can also add new items at the end of the list using append() method. Run the code:

myList.append(13)

The length of the list can be obtained using a built-in function: len(). Print the length of myList!

One of the powerful functions is that we can create nested lists in a list. Run the code:

A = ['a', 'b', 'c']
B = [1, 2, 3]
C = [A, B]
C
C[0]
C[0][1]

Let's see how to delete an item in a list. In the above A, if we want to delete an item 'a', run the code:

A.remove('a')

Alternatively, we can delete an item using its index. Run the code:

del A[0]

Also, we can get the index of an item. Run the code:

A = ['a', 'b', 'c']
A.index('b')

Sets

A set is an unordered list. And duplicate elements are not allowed to be contained in a set. Using a set, we can easily apply set operations such as union, intersection, and difference. Curly braces or the set() method can be used to create sets. Run the code:

A = {'a', 'b', 'c'}
A
A = {'a', 'b', 'c', 'c'}
A

Check the membership of an item in the set. Run the code:

'a' in A

Now, let's how to use the set operations. Run the code:

A = {'a', 'b', 'c'}
B = {'a', 'd'}
A.union(B) (or A|B)
A.intersection(B) (or A & B)
A.difference(B) (or A - B)

Now let's see a difference between union and update. The update method changes the set in place, while the union leaves the original set alone, and returns a copy instead. Run the code:

AA = {'a', 'b'}
BB = {'c'}
AA.update(BB)
print(AA)
AA = {'a', 'b'}
BB = {'c'}
AA.union(BB)
print(AA)

Let's add an item to a set using the add() method. Run the code:

A.add('d')
A

Dictionaries

Another useful data structure is the dictionary. In a dictionary, items are indexed by keys. Key can be any immutable type such as string and numbers. Simply, a dictionary is an unordered set of key-value pairs where keys should be unique. A pair braces creates an empty dictionary (i.e. {}).

The following examples show different ways to construct a dictionary. To know more about them, run the code:

contact = {}
contact['a'] = 1
contact
contact2 = {'a':1, 'b':2}
contact2
contact3 = dict(a=1, b=2)
contact3

Let's retrive the all keys. Run the code:

keys = contact2.keys()
keys

Also, if we want to create a list consisting of the key of a dictionary, run the following:

list(contact2.keys())

If you want to check whether an item (e.g. 'a') is the dictionary or not, run the code:

'a' in contact2

NumPy Array

Numpy os the core library for scientific computing in Python. It provides a high-performance multi-dimensional array object. However, in this unit, we will be using the NumPy library a little bit later on (in Week 5). For more information about NumPy, please refer to the following: https://docs.scipy.org/doc/numpy/reference/.

To use this library, let's first import it:

import numpy as np

We can create a numpy 1-dim array as:

a = np.array([1,2,3]) # 1-dim array
print (type(a)) # check the type of the array
print (a.shape)
print (a[0], a[1], a[2])
print(a)

Let's create a 2-dim array:

b = np.array([[1,2,3],[4,5,6]])
print(b.shape)
print(b[0,0])

Numpy also provides many functions to cretae arrays and manipulate array indexes.

For example, to create a 2x2 array of all zeros, run the code:

c = np.zeros((2,2))
print(a)

Basic math functions can be used as follows:

x = np.array([[1,2], [3,4]])
y = np.array([[5,6], [7,8]])
print (x+y) # elementwise sum producing an array
print(np.add(x,y)) # what about this?
print (x-y) # elementwise differences producing an array
print(np.subtract(x,y)) # what about this?
print (x*y) # elementwise product producing an array
print(np.multiply(x,y)) # what about this?
print (x/y) # elementwise sum producing an array
print(np.divide(x,y)) # what about this?
print(np.sqrt(x)) # elementwise square root producing an array

Here are some examples of a useful function that performs computations on arrays. It's sum(). Run the code:

x = np.array([[1,2], [3,4]])
print(np.sum(x)) # sum of the all elements
print(np.sum(x, axis=0)) # sum of each column
print(np.sum(x, axis=1)) # sum of each row

Defining Classes and Functions

Python is a powerful object-oriented programming (OOP) language. In OOP, we try to create reusable patterns of code. One important concept in OOP is the distinction between classes and objects:

  • Class: A concept or prototype for an object, defining a set of attributes and functions that characterise an object instantiated from this class.
  • Object: An instance of a class.

We can define a class by using the class keyword. Similarly, we can define its functions by using the def keyword. Let's create a class using the following code:

class myClass:
    name = "A"
    def myFunc (self):
        return "hello world!"

In the myClass class, the function myFunc is called a method as well. A method is often called to be a special function defined within a class. Note that the argument is self that is a reference to objects that are made based on this class. To reference instances (or objects) of the class, self will always be the first parameter.

Defining this class did not create any myClass objects. Instead, we need to create an object using the class. Now we create an object that is an instance of the myClass class:

x = myClass()

Then, we can use its method(s) using the dot operator:

x.myFunc()

The constructor method is used to initialise data in a class. It is run as soon as an object of a class is instantiated. Also known as the __init__ method, it will be the first definition of a class and looks like this:

def __init__(self):
    print("This is the constructor method.")

Add this function in the class and run the code:

x = myClass()
x.myFunc()
In [0]:
Can you see how the construct works out?

Now, let’s create a method that uses a variable, name that we will add. We will assign it a a value in a method. Our new class will be like that:

class myClass:
    name = "" # we will use this variable

    def __init__(self):
        print("This is the constructor method.")

    def assignName(self, name):
        self.name = name

    def myFunc (self):
        return "hi " + self.name + ": hello world!"

Then, run the following code:

x = myClass()
x.assignName("YB")
x.myFunc()

We learned how to create classes, instantiate objects, initialise attributes with the constructor method, and working with more than one object of the same class. OOP is an important concept when reusing code more straightforward and effectively, as objects created for one program can be used in another.

Usage of the Loop Statements

Loops can help you to execute a block of code repeatedly. There are two types of loops in Python: (1) for and (2) while.

The for Loop

For loops iterate over a list. Here is an example and run the code:

primes = [2, 3, 5, 7]
for prime in primes:
    print(prime)

A for loop can iterate over a sequence of numbers with the "range" function. The range function returns a new list with numbers of that specified range. Note that the range function is zero based. Run the following examples to understand how the for loop is iterating using range.

for x in range(5):
    print(x)
for x in range(6, 10):
    print(x)

Also, we can design a for loop statement using the range and len built-in functions. For example, the above example of printing the prime numbers can be also written in this way:

primes = [2, 3, 5, 7]
for index in range(len(primes)):
    print(primes[index])

Run the above code.

The While Loop

A while loop repeats as long as a certain condition is met. For example:

n = 0
while n < 5:
    print(n)
    n += 1

Run the above code.

If necessary, we can use break exit a for loop or a while loop. On the other hand, continue can be used to skip the current block. For example:

n = 0
while True:
    print(n)
    n += 1
    if n >= 5:
        break

The above example is the same with the following code using continue:

n = 0
while True:
    print(n)
    n += 1
    if n >= 5:
        break
    else:
        continue

Run the above two pieces of the code.

Useful built-in functions and defining lambda functions

In this section, we will learn some of the useful built-in functions in Python.

abs

abs return the absolute value of a numer. Run the code:

n = -21
print(abs(n))

dict

We've already learned how to use the dictionary function in the above activity. dict() creates a new data dictionary. There are several ways to create a dictionary. If no arguments are given, an empty dictionary is created. Also, we learned that we can use dict() with a tuple or list as its argument.

It'd be the best to think that a dictionary can be seen as a two-element (key, value) pairs.

enumerate

We can use enumerate() to iterate an iterable object. It returns an enumerate object. More specifically, such an object can bee seen a list of tuples, each containing a pair of count/index and value. So this function is veru useful for using both index and value of each value in a list.

To learn, what this function returns and how to use it, run the following code:

menu = ['pizza', 'pasta', 'hamburger']
print(menu)
print(list(menu))
print(list(enumerate(menu)))

Using the for loop, we can iterate the enumerate object. Look at and run the code:

for index, item in enumerate(menu):
    print (index, item)

If you want to change the start index, for example, starting with 1, use the code below:

for index, item in enumerate(menu, 1):
    print (index, item)

len

We've already looked into how to use len(). It returns the length (i.e. the number of items) of an object.

min, max

max() returns the maximum value in a given list while min() return the minimum. Run the code:

a = [1, 3, 5]
print(max(a))
print (min(a))

str

If you want to convert an integer value into a string, use str(). Run the code:

n = 10
str(n)

range

We've already looked at how to use this function range() with a for loop. It actually generates a list of numbers used to iterate over with a for loop.

The range() function has two sets of parameters. Let's look at how this function can be used with each case.

  1. range(stop): stop is an integer number and it returns numbers starting from 0 to stop excluded. That is, range(3) = [0, 1, 2]. Run the code:
     for i in range(5):
         print(i)
  2. range([start], stop, [step]): start: starting number of the sequence, stop: the same with the above, step: indicating the interval between each number in the sequence. Note that start and step are optional parameters. Run the code:

     for i in range(1, 5): # start, stop
         print(i)
     for i in range(1, 5, 2): # start, stop, step
         print(i)

sorted, sort

Using the sorted() function, we can easily sort a list in ascending order. Run the code:

a = [3, 1, 7, 5, 9]
sorted(a)

If you want to generate a descending sorted list, use this paramter (reverse=True) in the function:

sorted(a, reverse=True)

We can also use the list.sort() function. Run the code:

a.sort()
a
a.sort(reverse=True)
a

Now, let's run how to sort a dictionary in Python. We've learned that the dict() object is a useful container that can store a collection of key-value pairs. An an example, look at the following dictionary: myDic = {'a':1, 'b':3, 'c':2}

In the myDic object, the keys are a, b, and c, while the values are 1, 2, and 3. By calling the list method on it, we can easily retrieve the keys. Run the code:

myDic = {'a':1, 'b':3, 'c':2} 
list(myDic)

But as you see the result of the list, the items are not sorted. If we want to order the dictionary object by their keys, we can use sorted(). Run the code:

print(sorted(myDic))
print(sorted(list(myDic)))

The above two results should be same. Can you identify why?

On the other hand, if we want to order the dictionary object by their values, we can use the following example:

print(sorted(myDic.values()))

Finally, if we want to iterate the sorted dictionary by the keys, then see the following example:

for key, value in sorted(myDic.items()):
    print(key, value)

zip

This zip() function returns an iterator that aggregates elements from each of the iterables. For example, if we want to aggregate elements from two lists, we can use this method. Look at and run the following example:

A = [1, 2, 3]
B = ['a', 'b', 'c']
C = zip(A, B)
print (list(C))

If the length of the iterables are not equal, zip creates the list of tuples of length equal to the smallest iterable. For example:

A = [1, 2, 3]
B = ['a', 'b']
C = zip(A, B)
print (list(C))

What's the result? The length of C should be 2.

Do you want to unzip a list of tutples? Don't worry! We can do it:

C = list(zip(A,B))
newA, newB = zip(*C)
print (newA, newB)

Lambda functions

Lambda functions are anonymous functions (i.e. functions that are not bound to a name) at runtime, and we can create these functions the keyword "lambda".

The following code shows the difference between a normal function ("f") using def and a lambda function ("g"):

def f(x): 
    return 2**x
print (f(3))

g = lambda x: 2**x
print (g(3))

Run the above code.

Can you identify differences between f() and g()? As you can see, both functions do exactly the same and can be used in the same ways. However, note that g() does not include a "return" statement. Also you can put a lambda definition anywhere a function is expected, and you don't need to assign it to a variablev at all.

Let's see more examples about using a lambda function. Run the code: The following takes this a step further.

A = [2, 18, 9, 22, 17]
print (list(filter(lambda x: x % 3 == 0, A)))

Referring to the above, we used a built-in function, filter(), and defined a lamba function to do a specific thing as an argument of the function filter(). Of course, we can define a normal function using def and then use it as an argument to filter(), if we're going to use it several times, or if the function is too complex for coding.

However, if we need it only once and it looks simple, it could be better to use a lambda function. This creates very compact, yet readable code. Run the above code and check the result.

Multi-processing functions

Multi-processsing is a core part of parallel programming. Given a job, in parallel programming, multiple processors are performed separately to complete the job. The job is split into the number of sub tasks and each processor is responsible for carrying out each sub task using a separate memory.

The multi-processing functions in the Python's standard library has powerful features. If you wanto read about all tips and details, please refer to the following: https://docs.python.org/dev/library/multiprocessing.html.

In the following, we provide a brief overview of using the Pool approach that we will use throughout the semester for parallel data processing. The Pool class is used to represent a pool of worker processes. It has methods allowing us to offload a given job to the worker processes.

There are two methods that are particularly interesting:

- Pool.apply()
- Pool.map()
- Pool.apply_async()

Let's get started!

First, we need to import the Python multiprocessing moddule:

import multiprocessing as mp

Second, we need create a Pool object by defining the number of multiple parallel precessors that will perform together for parallel processing at the same time. That is, we create an instance of Pool and tell it to create n_processor worker processes.

pool = mp.Pool(processes = n_processor),

Third, we call the pool.apply() method to perform funtionName in parallel.

pool.apply(functionName, [argument_1, ..., argument_n]),

where functionName is the function that to be performed in parallel, and argument_1, ..., argument_n are arguments of the function functionName.

To explain, let's look at the example and run it:

import multiprocessing as mp

def cube(x):
    return x**3

pool = mp.Pool(processes = 2)

results = [pool.apply(cube, [x]) for x in range (1,5)]
print(results)

The above example shows how to calculate cube numbers using two parallel processors. Each pool.apply is performed by one of the two processors. The Pool.apply will lock the main program until all processes are finished. It's useful if we want to obtain results in a particular order for a given application.

We can also use the pool.map() method to map a function and an iterable to each process. In the above example, "results = [pool.apply(cube, [x]) for x in range (1,5)]" can be written as following using pool.map():

results = pool.map(cube, range(1,5))

Run the code above!

Contrast to pool.apply() or pool.map(), the pool.apply_async() method will submit all processes at once and retrieve the results as soon as they are finished. So, we need to use the get() method after calling apply_async() to obtain the results. Let's look at and run the following example:

results = [pool.apply_async(cube, [x]) for x in range (1,5)]
output = [p.get() for p in results]
print(output)

Congratulations on finishing this activity!

Having practiced today's activities, we're now ready to embark on a trip of the rest of exiciting FIT5202 activities! See you next week!